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Abstract: 

Studying the topology of so-called real networks, that is networks ob- 
tained from sociological or biological data for instance, has become a major 
field of interest in the last decade. One way to deal with it is to consider 
that networks are built from small functional units called motifs, which can 
be found by looking for small subgraphs whose numbers of occurrences in 
the whole network of interest are surprisingly high. In this paper, we pro- 
pose to define motifs through a local over-representation in the network and 
develop a statistic which allows us to detect them limiting the number of 
false positives and without time-consuming simulations. We apply it to the 
Yeast gene interaction data and show that the known biologically relevant 
motifs are found again and that our method gives some more information 
than the existing ones. 
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1. Introduction 

Recent work indicates that biological networks show recurrent small patterns, 
called network motifs and introduced in Milo ct al (2002). They can be thought of 
as small units of given function from which the networks are built. For instance, 
Alon (2007) describes the regulation role in transcriptional networks of a pattern 
of three vertices called the feed-forward loop. 

In order to reach a better understanding of the structure of biological net- 
works, it is then quite natural to ask which are the small patterns that are 
over-represented in those networks. Many attempts were made to answer that 
question. The approach of Milo et al (2002) simulates the distribution of the 
number of occurences of a subgraph using a huge number of random networks 
with the same degree distribution as the observed one. Others, for example those 
of Kashtan et al (2004) or Wernicke and Raschke (2006), rely on the calculation 
of a Z-scorc. Nevertheless, the distribution of the number of occurrences of small 
subgraphs is more heavy-tailed than a gaussian distribution and therefore those 
methods may lead to false positives. 
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To avoid simulations, one has to define a probabilistic model of random graph 
generation and to compute the p- value of the observed number of subgraphs in 
that model. 

That approach leads to a first step which is the choice of a random model 
to describe biological networks and which is simple enough to deal with com- 
binatorics on it. The simplest random graph model is the model introduced by 
Erdos and Renyi (1959), that is a graph on a fixed number of vertices whose 
edges are sampled as independent identical Bernoulli variables. 

Nevertheless, such random graphs do not show the heterogeneity of biological 
networks. Many more realistic models were therefore developped (e.g. MoUoy 
and Reed (1995), Albert and Barabasi (1999), Birmele (2008)). In particular, 
mixture models were studied, for example in Nowicki and Snijders (2001) or 
Daudin et al (2008). The main advantage from the latter is to give rise to 
heterogeneous graphs but which still depend on independant Bernoulli trials. 

Picard et al. (2008) propose to take a mixture model as the null model. As 
they can't compute the law of the number of occurrences of small subgraphs 
under that model, they propose to fit the best possible Polya-Aeppli distribu- 
tion to the subgraph distribution and to take the p- value of that approximate 
distribution. 

As pointed out in Milo et al (2002), another issue is that a small graph can 
appear as over-represented because it contains an over-represented subgraph, 
which is in fact the biological relevant structure. Moreover, Dobrin et al. (2004) 
show that the motifs in the yeast transcriptional regulatory network aggregate. 
For both reasons, we will consider a new definition of a motif: given a small 
graph m and a fixed occurence in the network of one of its subgraphs m', we 
will look for an over-representation of the number of occurences of m extending 
the given occurence of m'. In other words, a motif will be defined by a local 
over-representation rather than by a global one. 

We propose to tackle the problem of the computation of the p-value by con- 
sidering a model with independant edges, each one appearing with a given con- 
nectivity probability. Moreover, to avoid a bias due to the use of an approximate 
distribution, we will look for an upper bound of the real p-value. That conser- 
vative approach in motif detection induces a better limitation of the number of 
false positives. 

To do so, we will use concentration inequalities. This term is used to charac- 
terize inequalities which bound the probability of a random variable to be far 
away from its expectation. The most widely known is Chebychev's inequality 
which states that, for every random variable X and positive real t, 

F{\X~EX\>t)<'-^ 

Nevertheless, an abundant literature exists to show that under some condi- 
tions, the bound of F{X — EX > t) decreases in fact exponentially with respect 
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to t. The term concentration inequalities expresses that X is heavily concen- 
trated around its mean value. Surveys and recent work on such inequalities can 
be found for instance in Mc-Diarmid (1998), Boucheron et al (2003) or Kim and 
Vu (2004). We will use such an inequality to show that the number of motifs 
aggregating on a given submotif is highly concentrated. 

Given a motif m and one of its submotifs m', we will then be able to give an 
upper-bound for the probability of seeing, anywhere in the network, an occur- 
rence of m' which extends to a high number of occurrences of m. 

In the following, we will consider directed networks. A network is then a 
graph G = (V, E) of vertex set V and edge set E. We suppose that there are 
no multiple edges in the same direction but opposite edges between two vertices 
and self-loops are allowed. 

For any vertex set U CV, we denote by G[U] the induced subgraph of G on 
the set U, that is the graph of vertex set U where each edge is present if and 
only if it is present in G. If U is an ordered set, the vertices of G[U] are ordered 
in the same way. 

The paper is organized as follows: Sections 2 and 3 respectively detail the 
definition of a motif and the random graph model which will be used throughout 
the paper. The main result is given and proved in section 4. Finally, in Section 5, 
we apply our method to the transcriptional Yeast gene interaction network, 
which was already studied by Milo et al (2002) using the Mfinder tool. We show 
that the known over-represented motifs of size 3 and 4 of that network are found 
again and that the knowledge of the submotifs with respect to which they are 
over-represented allows to sort non-relevant motifs and to distinguish the role 
of the vertices in the relevant ones. Moreover, our method doesn't need any 
simulation and is therefore significantly faster than the existing ones. 

2. Network motifs and submotifs 

A network motif m of size fc is a directed graph on k vertices, with possible 
loops and opposite edges, that is that the two edges uv and vu between two 
vertices u and v may occur simultaneously. 

An ordered motif is a network motif whose vertices are ordered. Let mg and 
mi be two ordered motifs of size k and let (ui, . . . , Uk) and {vi, . . . , Vk) be their 
respective ordered vertex sets. We will say that mo and mi are isomorphic as 
ordered motifs and we will note mi m2 if, for every pair of integers {i,j), 
UiUj is an edge of mg if and only if ViVj is an edge of mi . 

Figure 1 shows two ordered motifs that are not isomorphic as ordered motifs, 
despite of the isomorphism of the underlying non-ordered motifs. 

The ordering of the motifs and the notion of ordered isomorphism are in- 
troduced for convenience in the latter formulas, but they do not influence the 
over-representation of a motif as we shall see below. 
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Fig 1. Isomorphic motifs that are not isomorphic as ordered motifs. The numbers represent 
here the orders of the vertices. 

Indeed, let m be a motif on k vertices and denote by 7V(m) the number of dis- 
tinct occurences of the motif in G. Let aut{'m) be the number of automorphisms 
of m. Taking one ordering of the motif, aut{in) can be seen as the number of 
orderings of the vertices which give rise to ordered motifs isomorphic to the first 
one. For example, if m is the motif on three vertices with two edges starting 
from the same vertex, the ordered motif mo in Figure 1 gives an ordering of 
the motif m. The only other isomorphic ordering is obtained by exchanging the 
labels 1 and 3, so aut{m.) = 2. 

Let us order the vertices of m to obtain an ordered motif nno- Then, describing 
all the ordered sets of k vertices in G and looking for an ordered isomorphism 
with nio, each occurence of m is counted aut{m.) times and finally 



for any choice of an ordered version mo of m. 

Let m be a motif on k vertices. A submotif of m is then a subgraph of m. In 
this article, we only consider the submotifs obtained by suppressing one vertex 
in m. That approach gives raise to a partition of the vertices of m into deletion 
classes, two vertices r and s of m being in the same deletion class if and only if 
they play the same topological role in m. In other words, r and s are in the same 
class if, given an ordering mo of m, the ordered motif obtained by interverting 
the labels of r and s is isomorphic to mo. 

If two vertices are in the same deletion class, the submotifs obtained by 
deleting r and s are isomorphic. Nevertheless, the opposite may not be true. 
Indeed, consider the feed- forward loop motif shown in Table 1. Deleting any 
of the three vertices leads to a single edge but all the vertices are in different 
deletion classes as they are not topologically equivalent (they have for instance 
different out-degrees). 

3. The random graph model 

Concentration inequalities are powerful tools but mostly apply on functions of 
independent random variables. Moreover, they make use of the expectation of 




(2.1) 
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the function of interest. Therefore, we have to consider random graph models 
satisfying the following conditions, which will also turn to be sufficient for our 
purpose: 

(i) All the edges are drawn independently. 

(ii) The expectation of any motif count is tractable. 



Let us consider the random graph model on n vertices defined by a connec- 
tivity matrix C = (cy )i<i.j<„ and such that all the edges are sampled indepen- 
dently under Bernoulli laws: 

It is straightforward under that model to derive an expression for the mean 
number of occurences of a given ordered motif mo of size k. 

Indeed, let 1, . . . , fc be the vertices of mo and denote by e^s the indicator 
function for the edge from vertex r to vertex s in m. Let (ii, . . . be any 
ordered list of k distinct vertices of G. Then 



P(G[(^l,...,^fe)] ~omo) = [] <X(l-c,,,;J' 

l<r,s<k 

Therefore, by Equation (2.1), one has 



That model has no practical interest as it has as many parameters as possible 
edges but it is a very general framework in which our theory holds. In practice, 
we will be able to apply our concentration results on all special cases of that 
model, which include the Erdos-Renyi model (1959) and the Expected Degree 
Model of Matias et al (2006). The general model introduced by Bollobas et 
al. (2007) can also be used when restricted to the case of a fixed sequence of 
vertices. The most interesting family of models which fits in that frame is the 
one of Mixture Models (e.g. Nowicki and Snijders (2001), Airoldi et al (2008), 
Daudin et et al (2008), Hofman and Wiggins (2008)) where the classes of the 
vertices are fixed. 



4. Concentration of Network Motifs 

Let m be an ordered motif on k vertices and m' the sub-motif of m on — 1 
vertices obtained by deleting a vertex s from a given deletion class of m. Let 
mg be an ordering of m' and denote by (ri, . . . ,rk-i) the ordered vertices of 
mg. mo then denotes the ordering (ri, . . . , Vk-i, s) of m. 

Consider any ordered set U = (ui, . . . ,Uk-i) of fc — 1 vertices in V{G) and 
define the random variable Nij{m.) as the number of copies of m in G whose 
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restriction on U is isomorphic to mo as an ordered motif (see Figure 2 for an 
example). Our aim is to bound the tail of the distribution of Ni;{m.). 




Fig 2. For U = (b,c,d), Yuirn'o) = 1 and Nijirn) = 3. Indeed, one obtains valid extensions 
of m'fy to mo by adding to U the vertices g, h or i. Adding the vertex f does not give rise to 
a valid extension because of the presence of an edge from c to f. 

Let US denote by Yij{ml^) the indicator of an occurence of mQ on U as an 
ordered motif, that is 

F(7(mo) = lG[C/]~om^- 

For each vertex v ^ U, let exi(j(mQ, mo) be the indicator of the fact that the 
extension from G[U] to G[U U {v}] is isomorphic to the extension of mj, to mo 
with respect to the order of the vertices. Denoting by Cab the indicator function 
for the edge from vertex a to vertex b in mo, 

ea;t^(mo, mo) = 1 4^ Xyy = Bgs and Vi, = 6^,^ and Xyu, = e^r; 

Note that ext'^{mQ, mo) only depends on the edges between v and U and is 
therefore independant oiYij{m.Q). Moreover, if G[U] corresponds to an occurence 
of mo and that ext'^{mQ, mo) = 1, G[U U {v}] corresponds to an occurrence of 
mo. In that case we say that the occurence of mo on J7U {v} is a valid extension 
of the occurence of vtiq on U. 

Moreover, as A^c/(m) is null when there is no occurence of mj, on U and as, 
if mo occurs on U, Ni;{ni) counts the number of its valid extensions, we have 
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Nu{m) = Yuim'o) e^tui<^ ^o) (4.1) 

Bounding the tail of the distribution of -/Vj/(m) can be done by using the 
fact that either Yi;{mQ) = and iV[/(m) is null or Yu{in'Q) = 1 and iV[/(m) 
is a sum on independant Bernouilli trials. In the latter case, changing one of 
the Bernoulli trials affects Nij{ni) by at most 1. Numerous results show that 
random variables showing such characteristics are highly concentrated around 
their mean value. One of them is the following theorem (Mc-Diarmid (1998)): 

Theorem 4.1. Let the random variables Xi, . . . ,X„ be independent, with < 
Xk < 1 for each k, and let S'„ = ^ Xk- Then, for every t > 0, 



>t] < e' 



We will use this theorem to bound the p- value of seeing, at any position U , 
an occurrence of iiIq having a number of valid extensions to mo which is large 
with respect to the expected number of valid extensions. 

In that purpose, let us denote by Extjj{iLnQ,ino) the mean number of valid 
extensions of a putative occurrence of mg on U, that is 

Extu{m'Q, mo) = E ^ ext^{m'Q, mo) 

For the following, in order to simplify the notations, wc will write Extjj 
instead of Extij{m'i^, mo) when there is no ambiguity. 

The main result of the paper is now the following theorem, which shows that 
the probability of finding a position U such that Nij{m) is much larger than 
Extjj decreases exponentially: 



Theorem 4.2. Let h be the function defined on [0, +cx3[x]0, +oo[ by 
h{X,Y) = 



ifX<Y 

■ X ' 



X In(^) + Y otherwise 



Then, for every t > 0, 



P{ma.x{h{Nir{m), Extu)) > t) < aut(m')E7V(m')e~* 

Note that for a fixed value oiYjhy-X-^ h{X, Y) is an increasing function, 
growing asymptotically as X\n(X), as illustrated in Figure 3. 

Therefore, the inequality of Theorem 4.2 bounds the probability of seeing 
any large -/Vc/(m), as 

max{h{Nu{m),Extu)) > t ^ 3U/ h{Nu{m.), Extu) > t 
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Fig 3. Curves of the function h for fixed values of Y 



To illustrate this, consider the particular case where the probabilistic event 
considered is {3U/Nu{m) > e^Extjj + t], for some positive t. It is straightfor- 
ward to see that 

Nu{ra) > e^Extu + t In(^M^) > 1 

eExtu 

=^> h{Nu{m.),Extu) > Nu{m) 
h{Nuim.),Extu) > t. 

The following corollary of Theorem 4.2 follows: 
Corollary 4.1. For every t > 0, 

F{3U/Nu{m) > e^Extu + t) < aut(m')IEA^(m')e"* 



In order to prove Theorem 4.2, let us first introduce a new function g and a 
technical result which will be useful later on: 

Lemma 4.1. Let g be the function defined by 

[0, +oo[ [0,+oo[ 

^' y ^ (l+2y)ln(l+y)-y 

i) g is increasing 

ii) g is one-to-one 
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iii) For every X > 0, Y > 0, 

'X -Y 



9 



Proof, i) and ii) are straightforward. To prove iii), let us consider X > Q and 
y > 0. Suppose that X <Y. Then 

ff(niax(^^,0))r - g{Q)Y = = h{X,Y). 

On the other hand, \i X >Y , 

g(max(^^,0))r = 9{^^)Y 

= XH^)-X + Y 
= h{X,Y). 

□ 

To prove Theorem 4.2, which is a result giving an exceptionality measure on 
the whole graph, we start by showing a local exceptionality result. To do so, we 
fix a position U and show the following result: 

Lemma 4.2. For every t > 0, 

^{h(Nu{m),Extu) >t)< PiYuini'a) = 1)6"*. 

Proof. Let us first decompose the event of interest by looking for an occurence 
of m' on i7: 



\h{Nu{m), Extu) >t) = V{h{Nu{m), Extu) > t\Yu(m'^) = l)P{Yu{m'a) = 

+P{hiNu{ni),Extu) > t\Yu{ni'a) = 0)F{Yu{ni'o) 

If Yuim'^) = 0, then iV,7(m) = and thus h{Nu{m), Extu) = 0. Therefore, 



\h{Nuini),Extu) >t)^ V{h{Nu{m),Extu) > AYuim'^) = l)¥{Yu{ra.'^) = 1). 

(4.2) 



Defining y^g ^{^^) , we have: 
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\h{Nu{m),Extu) >t\Yu{m'„) = l) 
= P(g(max( ^^^'^'' ~ '^^^^ , 0))Extu > t |yc/(m^) = l) by Lemma 4.1 iii) 

= P(g(max(^ii-^L_ 0)) > giy) \Yuini',) = l) 

= P(max( ^^^™j '^^^^^ ,0) > y inKm;,) = l) by Lemma 4.1 i) 

-Exit/ 



> y |y(7(m;,) = 1) asy>0 

> y \Yu{ni'o) = l) by using (4.1) 



Extjj 
^ Extu 

where the last equation is due to the fact that each ea::t^(mo, mo) is inde- 
pendant from Yuirn'o). 

Theorem 4.1 can be now appUed to the random variables (exij^(mo, mo))^^^^ 
and therefore imphcs that 

p( E.gaexr^(m;i,mo)-£;xtt/ ^ ^ ^-giy)E.t^ 
Extu 

that is, as g{y)Extu — t, 

¥{h{Nuini),Extu) > t\Yuini'o) = l) < e"* (4.4) 
Using Inequahty (4.4) in Equation (4.2) yields Lemma 4.2. □ 
To prove the main theorem, one has to note that 



u 

Therefore, 



{m&x{h{Nu{m),Extu) > t)} = [j{h{Nu{m), Extu) > t} 



'max(/i(iV[/(m),£;x<c/)) > t) < ^F{h{Nu{m), Extu) > t) 
^ u 

< ^P(n/(m^) = l)e-* 

u 

< aut{m')EN{m')e-* by using (2.1) 
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5. Application to the transcriptional gene regulation network of 
Yeast 

The results of Section 4 allow us to decide if a motif m is considered as being 
over-represented in a real graph with respect to at least one of its submotifs. To 
do so, we use the following steps: 

• Choose a random graph model and estimate its parameters on the real 
graph. 

• Choose a deletion class in m and consider the obtained submotif m'. 

• List all the occurrences of m and thus all the occurences of m' having at 
least one valid extension. Then, for every occurence nig in that list, count 
the number of its valid extensions Nir{m) and compute the mean value 
E(iV[/(m)), where U is the vertex set of mg. 

• Compute the bound of the p- value using Theorem 4.2. Having listed all 
positions U of interest and the associated values of Njj (m) allows the user 
to have access to the positions where the local concentration takes place. 

The first step needs to be done only once when dealing with a whole list of 
motifs whereas the three last ones have to be done for every deletion class of 
every motif of interest. 

This method is applied to the gene regulation network of Yeast, available 
at Uri Alon's Lab website (http://www.weizmann.ac.il/mcb/UriAlon/). That 
network has 688 vertices and 1078 edges and a directed edge between genes gi 
and 92 denotes a regulation from gene gi on the expression of gene g2. We don't 
take the type of regulation into account here, that is if it is an activation or an 
inhibition. 

The random graph model we consider is the Mixture Model with fixed classes. 
In that model, the n vertices are spread into Q classes. 

We consider a matrix n = {Trqr)i<q,r<Q which gives the connection proba- 
bilities between classes and for each pair {i^ j) of vertices, we define Cij — '^qr-j 
q and r being the respective classes of i and j. In other words, the connectivity 
probability between two vertices only depends on their mutual classes. 

To estimate the partition of the graph and the corresponding matrix 11, we 
use the algorithm developped by Latouche and al. (2008), and assign each vertex 
to its most probable class. 

Table 1 shows the unique motif of size three detected. The column p-value 
bound contains the upper bound of the real p-value given by Theorem 4.2 for t = 
maxjj {h{Nu (in), Extjj)), whereas the Agglomeration coefficient is the maximal 
observed value of Nu{m.). 

That motif, called the feed-forward loop, is known (see Alon (2007) for a 
deeper biological insight) to play a role in regulation processes by inducing a 
delay in the regulation of Z by A: X has first to activate Y and then X and Y 
together regulate Z. Note that X, Y and Z are not in the same deletion class, 
as mentioned in Section 2, and there are therefore three different submotifs to 
take into account, even if each of them is a single edge. 
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Our analysis finds the feed-forward loop to be over-represented and shows 
that there are genes X using the same gene Y to regulate a high number of 
genes Z (up to 15). On the contrary, there is no gene X using a high number 
of intermediates Y to regulate a given gene Z. 



Motif 


Deletion class 


p-value bound 


Agglomeration 




X 


5.15e - 3 


3 


xX^ ^^ \ z 


Y 


3.33 


2 




Z 


1.12e - 11 


15 



Table 1 

p-values for the three possible sub-motifs of the feed-forward loop in the Yeast regulation 

network 

Table 2 shows all the motifs of size four which are found to be over-represented 
with respect to at least one of their submotifs and with a bound on the p-value 
lower than 10~^ . 

The motif of size four with the lowest p-value is the bi-fan, that is the first 
motif shown in table 2. That motif consists in two regulators having an impact 
on two common genes. It was first shown to be over-represented in that network 
by Milo and al (2002) and appears first in all motif detection algorithms. 

Nevertheless, our approach gives a supplementary information, that is that it 
is highly over-represented with respect to the sub-motif obtained by suppressing 
one of the regulated genes. In other words, there exist in Yeast co-regulators 
which co-regulate a high number of genes simultaneously. The agglomeration 
coefficient shows that they may act on up to 37 common genes. On the other 
hand, the bi-fan is not over-represented with respect to the sub-motif obtained 
by suppressing one of the regulators, that is there exist no couple of genes that 
are influenced by a high number of common regulators. In fact, running the 
algorithm in that case leads to a p-value of .18 and an agglomeration coefficient 
of 4. That assymetry between couples of regulators and couples of regulated 
genes is well known by biologists but is found again here only by statistical 
means. 

Note also that among the six detected motifs, the third and the two last ones 
are in fact by-products of the over-representation of the feed-forward loop: they 
are not overrepresented with respect to their feed-forward loop submotif, that 
is the one obtained by deleting T. 

From a computational point of view, we obtain a significative improvement 
in terms of running time. Indeed, available methods are of two kinds: either they 
use a reasonable number of simulations and use a Z-score and are quite rapid 
but may lead to false positives, or they make use of a huge number of simulated 
graphs to obtain an empirical p-value but are very time-consuming. The part 
of our method which takes the most time is in fact the preprocessing by the 
algorithm of Latouche et al (2008) to estimate the parameters of the mixture 
model. The concentration part is very rapid as it only needs to compute expec- 
tations, which are easy to calculate in mixture models. Running our algorithm 
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and the empirical p- value option of the MFinder tool (Milo and al (2002)) for 
the graphs of size four in the Yeast network divides the running time by a factor 
100 (less than 10 minutes compared to more than 15 hours on a dual core). 



6. Conclusion 

We develop a new method to detect network motifs by looking for local over- 
representations. To do so, we show that, given an occurence of a submotif, the 
number of motifs extending it is highly concentrated around its mean value. That 
approach allows to consider the over-representation with respect to a submotif 
and to avoid time-consuming simulations. 
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Table 2 

Over-represented motifs of size four in the Yeast regulation network 
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